Expanding textual entailment corpora fromWikipedia using co-training

نویسندگان

  • Fabio Massimo Zanzotto
  • Marco Pennacchiotti
چکیده

In this paper we propose a novel method to automatically extract large textual entailment datasets homogeneous to existing ones. The key idea is the combination of two intuitions: (1) the use of Wikipedia to extract a large set of textual entailment pairs; (2) the application of semisupervised machine learning methods to make the extracted dataset homogeneous to the existing ones. We report empirical evidence that our method successfully expands existing textual entailment corpora.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation

Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves transl...

متن کامل

The Description of the NTOU RITE System in NTCIR-9

The textual entailment system determines whether one sentence can entail another in a common sense. We proposed several approaches to train textual entailment classifiers, including setting ancestor distance threshold, expanding training corpus, using different sets of features, and tuning classifier settings. The results show that a MC classifier trained by using an expanded training corpus an...

متن کامل

Automatic Building and Using Parallel Resources for SMT from Comparable Corpora

Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corpora are collected from Wikipedia documents and t...

متن کامل

A Preliminary Study of Finding Entailing Texts in a Domain-specific Monolingual Parallel Corpora

This paper introduces the possible usages, benefits, and challenges involved in the use of domain-specific monolingual parallel corpora in determining textual entailment (TE). A system that finds entailing text for a given statement is to be developed using monolingual parallel translations of the Bible as corpus as this is one of the most accessible monolingual parallel corpora. Different exis...

متن کامل

Recognizing Paraphrases And Textual Entailment Using Inversion Transduction Grammars

We present first results using paraphrase as well as textual entailment data to test the language universal constraint posited by Wu’s (1995, 1997) Inversion Transduction Grammar (ITG) hypothesis. In machine translation and alignment, the ITG Hypothesis provides a strong inductive bias, and has been shown empirically across numerous language pairs and corpora to yield both efficiency and accura...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010